Properties of Consensus Methods for Inferring Species Trees from Gene Trees 
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Abstract 

Consensus methods provide a useful strategy for combining information from a collection of gene trees. An important 
application of consensus methods is to combine gene trees to estimate a species tree. To investigate the theoretical 
properties of consensus trees that would be obtained from large numbers of loci evolving according to a basic 
evolutionary model, we construct consensus trees from independent gene trees that occur in proportion to gene tree 
probabilities derived from coalescent theory. We consider majority-rule, rooted triple (R*), and greedy consensus 
trees constructed from known gene trees, both in the asymptotic case as numbers of gene trees approach infinity and 
for finite numbers of genes. Our results show that for some combinations of species tree branch lengths, increasing the 
number of independent loci can make the majority-rule consensus tree more likely to be at least partially unresolved 
and the greedy consensus tree less likely to match the species tree. However, the probability that the R* consensus 
tree has the species tree topology approaches 1 as the number of gene trees approaches infinity. Although the greedy 
consensus algorithm can be the quickest to converge on the correct species tree when increasing the number of gene 
trees, it can also be positively misleading. The majority-rule consensus tree is not a misleading estimator of the 
species tree topology, and the R* consensus tree is a statistically consistent estimator of the species tree topology. 
Our results therefore suggest a method for using multiple loci to infer the species tree topology, even when it is 
discordant with the most likely gene tree. 



The goal of many phylogenetic and phylogeo- 
graphic studies is not the estimation of the in- 
dividual gene trees, but rather the estimation of 
the species- l evel phylogeny or population history 
dFelsensteinl. Il988l: iTakahatal . Il989l : iMaddisonl . Il997l : 
Nei and Kumarl . bond )." Among methods that have 
been used to estimate species trees from data on 
multiple loci, a popular approach has been to make 
use of sequences concatenated across the loci. In 
essence, this approach assumes that all loci have 
the same gene tree, whose estimate is also used as 
the estimated species tree. Because gene trees vary 
both local ly and across broad regions of orga i iismal 



ent gene tree topologies that may arise from sources 
such as incomplete lineage sorting or hybridization. 

As a result of these various sources of hetero- 
geneity, concatenation can perform poorly when 
sequences are analyzed as if they come from a 
single model. Inferences i nay be inconsistent 
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or 



the 



mix- 



genomes dChen and Lil . I2OO1I : IPollard et aP . I2OO6I : 



Hobolth et al.l . 200?! ). sequence data from multiple 



genes are expected to be the result of heterogeneous 
processes. Multilocus data can be regarded as mix- 
tures generated from different branch lengths and 
mutation rates on gene trees as well as from differ- 



(jKolaczkowski and Thornton 
ture gen erating the seque n ces rn ight not be iden- 
tifiable (jMatsen and Steel l2007 l) even when sites 
are generated from the same topology. Similarly, 
when sites are generated from different topologies 
but under the same mutation model, analyzing 
the conc atenated data can l ead t o misleading in 
fereri c es (iMossel and Vigodal . I2OO5I : lEdwards et al 



20071 : iKubatko and Degnanl . 120071 ). It is therefore 
useful to examine the behavior of other approaches 
in situations with a high level of gene tree discor- 
dance. 

One approach for estimating species trees that 
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does not assume all loci reflect the same underly- 
ing gene tree is consensus trees. However, relatively 
little is known about how consensus algorithms are 
expected to perform when applied to trees from mul- 
tiple loci. We explore the properties of three con- 
sensus algorithms applied to independent loci when 
gene tree discordance is the result of incomplete lin- 
eage sorting. In particular, we ask the question: as 
the number of gene trees considered from different 
loci increases, what is the probability that the con- 
sensus tree matches the species tree topology? 

We focus on majority-rule, rooted triple (R*), and 
greedy consensus trees. A survey of these and other 
consensus methods can be found in Bryant (2003). 
Majority-rule consensus trees consist of those clades 
that occur more than 50% of the time in a collec- 
tion of trees. (For simplicity, we always use 50% 
as the cut-off when referring to majority-rule con- 
sensus, although any greater proportion could be 
used instead.) The R* consensus tree is the most 
resolved tree that is compatible with a set of three- 
taxon statements (rooted triples), each of which is 
the rooted triple occurring most often (for a given 
set of three taxa) in a collection of trees on the same 
set of taxa. A tree containing these rooted triples 
can be cons tructed using an a l gorith m such as the 
method in (jBrvant and Berrvl . I2OO1I 1. We use the 



convention that if the set of rooted triples is incom- 
patible or if there is a tie for the most frequently 
occurring rooted triple, the R* tree is declared unre- 
solved or partially unresolved for those taxa causing 
the incompatibility. Greedy consensus trees are con- 
structed by sequentially adding one clade at a time, 
the most frequently occurring clade that is compat- 
ible with clades already included in the greedy con- 
sensus tree (breaking ties randomly). Greedy con- 
sensus trees are also sometime s called "Majority rule 
exten ded" (iFelsensteinl . Il993l ^ , or simply "Majority- 
rule" (IBauml . I2OO7I ) , and the gree dy consensus algo - 
rithm is impl emented in PHY LIP (|Felsensteinl . ll993l ) 
and PAUP* (jSwoffordl . Il998l ^. For a given set of in- 
put trees, the greedy and R* consensus tree s are al- 
ways refinements of the majority-rule tree (jBryantl . 

but can refine the majority-rule tree in differ- 
ent ways. 

The three consensus methods considered in this 
paper exhibit different behaviors when the num- 
ber of genes increases. We find that in evolution- 
ary models that generate sufficient gene tree dis- 
cordance, adding genes can increase the probabil- 



ity that the majority-rule consensus tree is unre- 
solved. However, this unresolved tree is compati- 
ble with the species tree in the sense that one of 
its refinements has the species tree topology. We 
call sets of branch lengths leading to this lack of 
resolution unresolved zones. Also, as the number 
of independent, known gene trees increases, the R* 
tree becomes fully resolved and matches the species 
tree. However, greedy consensus trees, which are al- 
ways resolved, can be misleading in the sense that 
adding more genes can be more likely to result in 
a tree that does not match the species tree. We 
use the term too-greedy zone to denote the set of 
species tree branch lengths for which greedy con- 
sensus trees constructed from infinitely many loci 
disagree with the s pecies tree. This is analogou s 
to the anomaly zone ( Degnan and Rosenberg . 20061 ) . 
the set of branch lengths for which the most prob- 
able gene tree does not match the species tree. In 
the case of four-taxon asymmetric species trees, the 
too-greedy zone is a subset of the anomaly zone. 

In this paper, we first show some four-taxon ex- 
amples of consensus trees when the number of loci 
approaches infinity but branch lengths in the species 
tree vary. This is followed by derivations for four- 
taxon trees of the unresolved zones for majority-rule 
consensus trees and the too-greedy zone for greedy 
consensus trees. The main results of the paper (The- 
orems 1, 3, 4, and 5) give different results for the 
limiting behavior of the three consensus methods 
used. Finally, we consider the same consensus meth- 
ods with finitely many loci sampled, including some 
examples with three and four taxa. 

The Multispecies Coalescent 

We use the term "multispecies coalescent" for the 
model in which coalescent processes occur in each 
branch of a species tree and for which all possible 
coalescent events within a branch are equally likely. 
This is the model that has previously been used to 



(Taiima. 1983: 


Pamilo and Nei. 1988: 


Takahatal 


19891: Rosenberg. 


2002: 


Deenan and Salteil. 2005) 



This model assumes that the genes from the different 
species are orthologous, that there is no recombina- 
tion or horizontal gene transfer within the genes of 
interest, and that natural selection is not acting on 
these genes. This model also assumes that popula- 
tion sizes are constant within species tree branches 
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(although not necessarily across branches) and that 
populations are panmictic. 

Definitions 

Unless otherwise noted, we use "gene tree" to refer 
to a gene tree topology, and "species tree" to refer to 
a species tree topology with internal branch lengths 
specified. Because two or more lineages in a popula- 
tion are needed for a coalescence to occur, lengths of 
external branches (those leading to tips of the species 
tree) do not affect probabilities of gene tree topolo- 
gies when only one lineage is considered per species. 
Branch lengths on species trees are measured in co- 
alescent units, the number of generations divided by 
the effective population size (twice the effe ctive pop- 
ulation size for diploids (jHein et al.l . l2005l )). 

Nodes on gene trees correspond to coalescent 
events. For example if a node on a gene tree is the 
root of the subtree ((AB)C), this node corresponds 
to the coalescent event that joins the lineage ances- 
tral to (AB) with the lineage ancestral to C, where 
(AB) itself represents the coalesced lineage combin- 
ing the lineages from taxa A and B. We say that 
(AB) is a lineage "containing" A and B. We addi- 
tionally say that two taxa "join" or "are joined" on a 
branch b if the lineages (i.e. clades) containing those 
taxa coalesce on branch b. For example, if (AB) 
and C coalesce on branch 3, then A and C "join" 
on branch 3. Clades with only two taxa (on either 
species or gene trees) are called cherries. We use 
the same letter (such as A, B, etc.) to refer to both 
a taxon and to the gene lineage sampled from that 
taxon. 

We use the notation (AB)C for the three-taxon 
statement (rooted triple) that the most recent com- 
mon ancestor (MRCA) of gene lineages A and B on 
a species tree is not an ancestor of C. This nota- 
tion is similar to the notation for a three-taxon tree 
but does not have the outer set of parentheses. If a 
given species tree (with topology and internal branch 
lengths specified) is a, then Pa[-] indicates probabil- 
ities of events for gene lineages when a is the species 
tree. For example, P^[(AB)C] and P^[((AB)C)] are 
used to indicate the probabilities of the rooted triple 
(AB)C and the gene tree ((AB)C), respectively. The 
expression P(j[{ABC}] is used to denote the proba- 
bility that {ABC} is a clade on the gene tree. 

Because we frequently refer to time looking back- 
wards starting from the present, we use "before" and 
"first" to mean "more recently" and "most recently" , 



and we use "more anciently than" in the usual sense 
of looking at time from the past to the present. 

Asymptotic Consensus Trees 

Consensus trees are used to summarize a set of 
trees defined on the same set of taxa. A consen- 
sus algorithm takes the trees as inputs, so that the 
method of producing the input trees is not part of 
the consensus algorithm. Typically the trees sum- 
marized might be estimated trees such as those that 
are obtained from separate genes, different models, 
or different bootstrap samples. In all of these cases, 
the consensus tree is a function of some data set a nd 
is therefore a statistic ( Casella and Berger . 

Using gene tree probability distributions, we can 
also compute the consensus tree that would be re- 
turned in the limit as the number of gene trees 
approaches infinity. This calculation assumes that 
these gene trees are correctly estimated, indepen- 
dent, and generated by the multispecies coalescent 
model. In this setting, the proportion of occur- 
rences for a gene tree topology asymptotically ap- 
proaches its probability under the multispecies co- 
alescent model as the sample size (the number of 
independent loci) approaches infinity. 

Consensus trees obtained from these asymptotic 
proportions are not functions of data, and are there- 
fore not statistics. Instead they are properties solely 
of gene tree probability distributions. These in turn 
are functions of the species tree, which we can con- 
sider to be a paranieter fo r a gene tree distribution 
(jPegnan and Salten l2005l ) . Intuitively, we can also 
think of a consensus tree computed from gene tree 
probabilities under the multispecies coalescent as the 
consensus tree that would be obtained from an infi- 
nite number of independent, correctly inferred gene 
trees. 

We define an asymptotic consensus tree for a 
species tree to be the tree topology that would be ob- 
tained if a consensus algorithm had input gene trees 
in proportion to their probabilities (under the mul- 
tispecies coalescent model). We note that under the 
multispecies coalescent model that we are consider- 
ing, every gene tree topology has positive probabil- 
ity given any species tree, and therefore every gene 
tree is included in the consensus algorithm. Con- 
sequent ly, methods such as Adams an d strict con- 
sensus (jBrvantJ . 120031 : iFelsensteinl . l2004l ^ — which pre- 
serve information shared by all input trees — result in 
star trees when probabilities under the multispecies 
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coalescent are used. As more gene trees are sam- 
pled, the probability approaches zero that there is 
strict agreement for the relationships for any sub- 
set of taxa. We therefore focus on three consensus 
algorithms that do not require strict agreement. 

For each of these algorithms, we first characterize 
the asymptotic consensus trees for three and four 
taxa. We also prove general theorems about these 
trees for arbitrary numbers of taxa. We then return 
to the three- and four-taxon cases and consider the 
approach to the asymptotic consensus tree based on 
finitely many loci. 

The majority-rule asymptotic consensus tree 
(MACT) can be determined by listing the proba- 
bility of monophyly for each subset of taxa. If a 
subset of taxa appears on the list with probability 
greater than 1/2, then that group is contained in the 
MACT. This is the same method traditionally used 
to determine majority-rule consensus trees, but here 
we use theoretical probabilities rather than observed 
proportions. 

Similarly, the R* asymptotic consensus tree 
(RACT) can be determined by calculating the prob- 
ability of each of the three possible rooted triples for 
each of the (3) subsets of three taxa. The RACT 
then consists of those rooted triples that have the 
highest probability for each subset of three taxa. 
For any three taxa and a strictly bifurcating species 
tree, the rooted triple corresponding to the species 
tree is always the most probable (see Proposition 2 
below) — i.e., there are no ties. The set of rooted 
triples for all (3) subsets of three taxa uniquely iden- 
tifies the species tree Steel (1992, Prop. 4); thus 
the RACT is always uniquely identified and fully re- 
solved under the multispecies coalescent model. 

The greedy asymptotic consensus tree (GACT) for 
Ti taxa can be obtained by ranking probabilities of 
the 2" — 71 — 1 clades with two or more taxa. The 
most probable clade is incorporated into the consen- 
sus tree, and then the list of clade probabilities is 
updated by removing any clades incompatible with 
those already in the tree. This process is repeated 
until the tree is fully resolved, randomly picking 
clades in the case of ties. 

The three types of asymptotic consensus trees — 
MACT, RACT, and GACT— are purely mathemat- 
ical functions of gene tree probabilities. They are 
therefore properties of species trees. Consensus trees 
constructed from finitely many loci under different 
consensus algorithms are random variables, and are 
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Figure 1: Four-taxon species trees with internal 
branch lengths x and y. 

increasingly likely to match their asymptotic coun- 
terparts as the number of loci approaches infinity. 

Examples 

Examples which illustrate the construction of 
asymptotic consensus trees for the three methods in 
this paper are shown in Table [H which lists probabil- 
ities of each gene tree for four taxa, for several sets 
of branch lengths on the species tree in Figure lA. 
Also listed are probabilities for two- and three-taxon 
clades, and probabilities for the 12 rooted triples. 
For four taxa, there are six possible cherries and 
four possible three-taxon monophyletic groups. Note 
that because the cherries are not mutually exclusive, 
their probabilities sum to more than one. Also, be- 
cause it is possible for a tree to not have any three- 
taxon monophyletic groups, the sum of the proba- 
bilities for subsets of three taxa is less than one. 

For the examples in Table 1, majority-rule con- 
sensus returns each of the four possible trees illus- 
trated in Figure 2A. Greedy consensus returns the 
matching tree for all examples in the table, except 
when (x,y) = (0.05,0.05), in which case it returns 
((AB)(CD)). This topology is also the most probable 
gene tree for those branch lengths. R* consensus is 



4 



the only consensus method considered which returns 
the matching tree for ah branch lengths used. From 
Theorem 3 this result is not limited to the example 
chosen, but applies to any branch lengths and any 
binary species tree. 

As an example from the table, we see that if 
the species tree has topology (((AB)C)D) and has 
X = 0.6 and y = 0.4, then the groups {AB} and 
{ABC} both occur with probability greater than 
1/2, and {CD} occurs with probability less than 
1/2. Thus the MACT for this species tree has the 
topology (((AB)C)D), since this is the only four- 
taxon topology which has exactly the monophyletic 
groups {AB} and {ABC}. Both probabilities are 
only slightly larger than 1/2, however, so in a small 
sample of correctly inferred trees, it is likely that ei- 
ther {AB} or {ABC} would occur less than 50% of 
the time, or that {CD} would occur more than 50% 
of the time. In these cases, the majority-rule con- 
sensus tree would be unresolved or would not match 
the species tree. 

For the greedy consensus algorithm, we would se- 
lect the {AB} clade to be in the tree (because it is the 
most probable other than {ABCD}), and then elim- 
inate all clades except {CD}, {ABC}, and {ABD} 
from consideration since these other clades are in- 
compatible with {AB}. From the three remain- 
ing clades, {ABC} is the most probable — hence the 
GACT has clades {AB} and {ABC}, which means 
that (((AB)C)D) is the GACT. For the R* consensus 
algorithm, the most probable rooted triples for each 
set of three taxa are: (AB)C, (AB)D, (AC)D, and 
(BC)D. Since (((AB)C)D) is the only tree for these 
taxa that is compatible with these rooted triples, R* 
also returns the matching tree. 

Choosing the branch lengths to be {x,y) = 
(0.4,0.6) (Table 1, second branch length column), 
illustrates that the behavior of MACTs is sensitive 
to the order of the branch lengths. Switching the 
lengths for x and y can change whether the MACT 
is fully resolved. For this tree, most (about 62%) 
gene trees are expected to have an {AB} clade, so 
this clade is very likely to be in the majority-rule con- 
sensus tree for a large enough number of gene trees; 
however, less than 46% of trees are expected to have 
{ABC} in a monophyletic group, so the MACT does 
not have {ABC} clade. Since no other group is 
monophyletic with probability greater than 1/2, this 
MACT is not fully resolved, and is ((AB)CD). Note 
that this lack of resolution is a theoretical limitation 



of majority-rule consensus and occurs even though 
the species tree and gene trees are fully resolved 
(there are no "hard" polytomies). The lack of reso- 
lution is also not due to insufficient information — in 
other words, the lack of resolution cannot be over- 
come by collecting more data (there are no "soft" 
polytomies). 

When the branch lengths are {x, y) = (0.8, 0.3) 
(Table 1, third branch length column), majority- 
rule consensus returns the other partially re- 
solved tree, ((ABC)D). For the branch lengths 
(x,y) = (0.3, 0.3), (0.1, 0.1), (0.05, 0.05) (columns 
four through six), since no monophyletic subset of 
taxa has probability greater than 1/2, the MACTs 
for this species tree are star phylogenies. When the 
branch lengths are {x,y) = (0.1,0.1) and {x,y) = 
(0.05,0.05), ((AB)(CD)) is the most probable gene 
tree, although it does not match the species tree. 
Gene trees that are more probable than the gene 
tree matchir ig the species tree are c a lled a nomalous 
gene trees ( Degnan and Rosenberg . 20061 ). When 
{x,y) = (0.3,0.3), no anomalous gene trees occur, 
so this example illustrates that unresolved majority- 
rule consensus trees can arise even when there are no 
anomalous gene trees. When {x,y) = (0.05,0.05), 
the most probable clade is {AB}, which has proba- 
bility 0.275, so it is included in the greedy consen- 
sus tree. The second most probable clade compati- 
ble with {AB}, however, is {CD}, which has prob- 
ability 0.212, and thus the greedy consensus tree is 
((AB)(CD)), which does not match the species tree. 

We now describe asymptotic consensus trees for 
more general sets of branch lengths, considering 
three- and four-taxon trees as well as trees with ar- 
bitrary numbers of taxa. 

Majority-rule Consensus 

Three taxa. — For the case of three-taxon trees, the 
MACT is resolved if the probability of the matching 
tree is greater than 1/2. Using the well-known prob- 
ability of congruence for a gene t ree given a three- 
taxon species tree, 1 — (2/3)e~-^ (jNeil . 119871 ). where 
T is the length of the one internal branch, this prob- 
ability is greater than 1/2 if T > log(4/3) ^ 0.28768. 
If the internal branch length is less than this value, 
then increasing the number of independent gene 
trees also increases the probability that the trees do 
not produce a resolved majority-rule consensus tree, 
even though the matching gene tree is more likely 
than any other gene tree. 
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Figure 2: Unresolved zones. The shaded regions are different areas of the unresolved zones leading to 
different unresolved majority-rule consensus trees. (A) The species tree is (((AB)C)D). A star tree is the 
limiting consensus tree for the red region, and the orange region corresponds to the tree with the {ABC} 
clade unresolved. For comparison, the anomaly zone is also plotted as the area under the heavy, dark curve. 
The anomaly zone cuts across two regions of the unresolved zone, and the area under the line starting 
from {x,y) = (0,0.154) which creates the approximately triangular region is the part of the anomaly zone 
with three anomalous gene trees. (B) The species tree is ((AB)(CD)). The unresolved zone in this case is 
similar in size to that of (a), but there is no anomaly zone for this species tree. 



Four taxa. — For four-taxon trees, the branch 
lengths needed for a clade to be in the MACT can be 
obtained by setting the probability of the clade to be 
greater than 1/2 and solving for branch length y in 
terms of branch length x. These clade probabilities 
are functions of gene tree probabilities and are listed 
in Table 1. The model four-taxon trees are shown in 
Figure [H 

Details for deriving conditions for clades to be in 
the MACT are given in Appendices 1 and 2. First we 
consider the species tree with topology (((AB)C)D). 
Following Figure 1, let x be the length of the branch 
(in coalescent units) ancestral to A and B, but not C, 
and let y be the length of the other internal branch. 
Then {ABC} is a clade in the MACT if and only if 

r 2e^^ 

X > log(4/3) and y > log 



1 



3g3x _ 4g2x 



and {AB} is a clade if and only if 



y > log 



9g3x 



(1) 



(2) 



These two conditions partition the space of branch 
lengths into the four possible MACTs for this species 
tree (Fig. [2]A), where x = log(4/3) ^ 0.28768 is a 
vertical asymptote. The MACT is: 



(((AB)C)D) if ^ and ^ both hold, 

((ABC)D) if ^ holds and ^ fails, 

((AB)CD) if (dD fails and ^ holds, 

(ABCD) if ^ and Q both fail. 



Similarly, if the species tree is ((AB)(CD)), with 
y denoting the length of the branch ancestral to 
(AB) and x denoting the length of the other inter- 
nal branch, then (AB) is a clade in the MACT if and 
only if 

126^ 2 



y > log 



9e^ 



(3) 
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and (CD) is a clade in the MACT if and only if 

2 



X > log(4/3) and y > log 



9e^ - 12 



(4) 



Again, these two conditions partition the branch 
length space into four regions, one for each of the 
possible MACTs (Fig.EB), and x = log(4/3) is also 
a vertical asymptote for this graph. The MACT is: 



((AB)(CD)) if © and (gD both hold, 

((AB)CD) if © holds and © fails, 

(AB(CD)) if §^ fails and (gD holds, 

(ABCD) if da]) and @ both fail. 



Arbitrarily many taxa. — Because equations 
(l)-(4) characterize all possible MACTs for four 
taxa, it follows that four-taxon MACTs are never 
misleading in the sense that a four-taxon MACT 
never has a clade that is not a clade in the species 
tree. Due to lack of resolution, however, the MACT 
may fail to have clades that are in the species tree. 
Although we have obtained this result by explicit 
computation for the four-taxon case, the result 
holds for larger trees: 



Theorem 1. (i) The majority-rule asymptotic 
consensus tree does not have any clades not on the 
species tree, (ii) For all species tree topologies with 
n > 3 taxa, there exist branch lengths for which the 
majority-rule asymptotic consensus tree is not fully 
resolved. 

The proof of the first part of Theorem 1 is pro- 
vided in the section on R* trees below since it is 
a consequence of the consistency of R* consensus 
(Theorem 3). The second part of Theorem 1 follows 
for the three- and four-taxon cases from the calcu- 
lations above. For larger trees, the second part of 
Theorem 1 follows from the inconsistency of greedy 
consensus (Theorem 5) and the fact that greedy con- 
sensus trees are refinements of majority-rule trees. 

The plots in Figure[2]are analogous to the anomaly 
zone, the region in branch length space in which the 
most likely gene tree does not ma tch the species 
tree (IDegnan and Rosenberg! . 12006 . Fig. 2). Note 



that the region of parameter space in which MACTs 
are not fully resolved (and therefore do not fully re- 
cover the species tree) is considerably larger than 
the anomaly zone. For example, when we set x = y 
for the four-taxon asymmetric tree, the largest value 
of x that is still in the anomaly zone is approxi- 
mately 0.1568 (Degnan and Rosenberg, 2006); but 
for majority-rule consensus, x = y = 0.345 is ap- 
proximately the largest value for which x = y and 
the MACT is fully unresolved, and x = y = 0.507 
is the largest value for which the MACT is partially 
unresolved, equaling ((AB)CD). For the symmetric 
four-taxon tree, the values x = y = 0.394 result 
in a star consensus tree. This is somewhat surpris- 
ing since these values result in the partially resolved 
tree ((AB)CD) for the asymmetric species tree. For 
the asymmetric four-taxon species tree, the anomaly 
zone is a subset of the zone in which the MACT 
is unresolved. For the symmetric species tree, the 
MACT is unresolved, but there is no anomaly zone. 
For four taxa, it is always the case that if a species 
tree has an anomalous gene tree, it does not have a 
fully resolved MACT. 

R* Consensus 

Three taxa. — In the case of three taxa, we note 
that the greedy and R* algorithms are equivalent 
when there are infinitely many loci. For both algo- 
rithms, the most frequently occurring clade also de- 
termines a three-taxon statement. In the asymptotic 
case, there is a uniquely occurring most frequent 
tree. This tree has probability 1 — (2/3) e^-^ > 1/3 
(where T is the one internal branch length), and the 
other two trees each have probability (l/3)e~-^ < 
1/3. Thus, for the three-taxon case, as the number 
of loci approaches infinity, the probability that the 
matching gene tree is the most frequent approaches 
1. 

Arbitrarily many taxa. — We show that R* con- 
sensus trees are consistent estimators of species tree 
topologies. This consistency is based on the fact that 
for any set of three taxa, the rooted triple in the 
species tree is the highest-probability rooted triple 
in the gene tree distribution. 

Lemma 2. Let a be the species tree where S is 
the set of taxa on a. For any A, B, C € S, if a has 
the grouping (AB)C, then P^[(AB)C] > P^[(AC)B]. 
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Proof. Let J be the set of branches on a on which 
A and B can join (i.e., either the hneages A and B or 
the hneages containing A and B can coalesce in J'), 
but on which A and C cannot join. Note that is 
nonempty and that any branch in J is an ancestor of 
A and B, and not an ancestor of C. Let /C be the set 
of branches on which gene hneages A and C can join. 
Any branch in /C is an ancestor of A and C. Since 
(AB)C is a rooted triple, any ancestor of A and C is 
also an ancestor of B. Thus for any branch /c G /C, if 
none of the lineages A, B, and C have joined, they 
are free to do so on k. The probability that A and 
B join on a branch in J is positive. If A and B do 
not join in J', then the probabilities that A and B, 
A and C, and B and C are the first two of A, B, and 
C to join in /C are equal since all pairs of lineages 
in a population are equally likely to coalesce. Thus 
P,[(AB)C] > P.[(AC)B]. □ 

Theorem 3. For a species tree a, the R* asymp- 
totic consensus tree has the same topology as a. 

Proof. By Lemma 2, any rooted triple in the 
species tree has higher probability in the gene tree 
distribution than the other two rooted triples for the 
same set of three taxa. Thus, the set of rooted triples 
from which the R* tree is constructed is exactly the 
set of (3) rooted triples in the species tree, where n is 
the number of taxa. From Steel (1992), a tree topol- 
ogy is uniquely specified by its set of rooted triples, 
from which it follows that the only tree topology con- 
taining the (3) triples is the topology of the species 
tree itself. □ 

Proof of Theorem 1. (i) This result follows from 
Proposition 3 and Theorem 2.14 of Bryant (2003), 
according to which every clade in the majority rule 
consensus tree is in the R* tree. Because the MACT 
and RACT are the majority-rule and R* consensus 
trees applied to coalescent gene tree probabilities, 
every clade in the MACT must appear in the RACT. 
Because in the limit of infinitely many gene trees, the 
R* tree is fully resolved, it follows that if the MACT 
has one or more multifurcations, the R* tree is one of 
the possible resolutions of the MACT. Because the 
R* tree has the same topology as the species tree 
(Theorem 3), the MACT either has the species tree 
topology or one its resolutions has the same topology 
as the species tree. □ 

Theorem 3 describes the RACT, which is a math- 



ematical function of gene tree probabilities, and 
therefore of species tree branch lengths. When an 
R* consensus tree is computed from data, however, 
it has some probability of not matching the species 
tree. For an estimator of a parameter to be statis- 
tically consistent, the probability that it gets arbi- 
trarily close to the parameter must approach 1 as 
the sample size approaches infinity. Theorem 4 de- 
scribes the behavior of the R* consensus tree con- 
structed from data when the sample size approaches 
infinity. 

Theorem 4. R* consensus is statistically consis- 
tent. 

The proof of Theorem 4 uses a generalized ver- 
sion of Bonferroni's inequality, according to which if 
there are k events each with probability p = 1 — q, 
the probability t hat they all occur is greater than or 
equal to 1 — kq (jRossl . Il998l . p. 63). 

Proof. It must be shown that for any e > 0, 
there exists k such that if there are at least k in- 
dependent gene trees, the probability is greater than 
1 — e that all rooted triples in the species tree are 
also the most frequently occurring rooted triples for 
each set of three taxa in the collection of gene trees. 
Let the species tree be a with taxon set S. For 
n taxa, there are (3) sets of three taxa in S. Let 
A, B, and C be three distinct taxa in S. With- 
out loss of generality, assume that (AB)C is the jth 
rooted triple on a. From Lemma 2, Po-[(AB)C] > 
P^[(AC)B] = P^[(BC)A], where the equality holds 
by symmetry. Thus Pct[(AB)C] = 1/3 + 5 and 
P^[(AC)B] = 1/3 - 6/2 for some 6 > 0. We use P to 
denote sample proportions of rooted triples. For any 
e > 0, because sample proportions converge in prob- 
ability to their parametric values (by the Weak Law 
of Large Numbers) as the sample size tends to 00, 
we can choose the number of loci kj such that with 
probability greater than 1-6/(3), P^t[(AB)C] > 1/3, 
P<,[(AC)B] < 1/3, and P^[(BC)A] < 1/3. Letting 
k = max^.|^.g|-|^ 2 (")}%5 of three taxa 

the probability that its most common rooted triple in 
the gene tree distribution matches the rooted triple 
in the species tree is greater than 1 — £/{^)- The 
probability that all of the (3) rooted triples in the 
R* tree are rooted triples in the species tree is there- 
fore greater than 1 — e. □ 

Greedy Consensus 
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Three taxa. — For the case of three taxa, greedy 
consensus appUed to gene trees is asymptotically 
guaranteed to result in the species tree as the number 
of gene trees increases. If the species tree has topol- 
ogy ((AB)C) and the one internal branch has length 
T, a random gene tree has clade (AB) with probabil- 
ity l-(2/3)e-^ > 1/3, whereas (AC) and (EC) each 
occur with probability less than 1/3. Thus (AB) 
is always the most probable cherry for this topol- 
ogy, and the GACT always matches the species tree 
topology. For finitely many loci, greedy and R* con- 
sensus are not equivalent because they handle ties 
differently, with the R* consensus tree sometimes be- 
ing unresolved. 

Four taxa. — For the four-taxon symmetric species 
tree and for any choice of branch lengths, the GACT 
has the same topology as the species tree (Ap- 
pendix 2). If the species tree is (((AB)C)D), then 
the GACT can be the symmetric tree ((AB)(CD)). 

To find the set of branch lengths for which the 
GACT fails to match the asymmetric species tree 
topology, let X and y be the lengths of the deeper 
and more recent internal branches, respectively, for 
the tree (((AB)C)D) (see Fig. lA). For this species 
tree, the region where the GACT is ((AB)(CD)), the 
"too-greedy" zone, consists of those values of x and y 
for which the clade {CD} is more probable than the 
clade {ABC} (see Appendix 2). The values of x and 
y for which P({CD}) > P({ABC}) are characterized 
by 



y < log 



3e2^ - 2 



(5) 



18(e3^ 

The right-hand side of this inequality is strictly less 
than the bou ndary of the anomaly zo n e for the tree 
(((AB)C)D) dOegnan and Rosenberel . I2OO6I . Equa- 
tion (4)); thus for this tree, the too-greedy zone is a 
subset of the anomaly zone (Fig. [5]) . 

More than four taxa. — The result that greedy con- 
sensus can be misleading in the four-taxon case gen- 
eralizes to any species topology with more than four 
taxa. Intuitively, by making some branches long and 
some short (so that coalescent events occur with 
probability arbitrarily close to or 1), trees with 
five or more tOjXci can be made to behave similarly 
to the four-taxon asymmetric case. The strategy of 
the proof is therefore similar to that of Lemma 5 in 
Degnan and Rosenberg (2006). 

Theorem 5. For three-taxon species topologies, 
and for four-taxon symmetric species topologies, the 



GACT matches the species tree; for the asymmet- 
ric topology with n = 4 taxa and for every species 
topology with n > 5 taxa, greedy consensus is in- 
consistent. 



Lemma 6. The four-taxon asymmetric topology 
(((AB)C)D) has a set of branch lengths which makes 
greedy consensus fail to match the species tree. 

Proof. This set is explicitly derived in Appendix 2 
and is given in equation (5) and Figure [5l □ 

Lemma 7. For every bifurcating species tree with 
n > 5 taxa and every A; > 1 with 2^~^^ < n, there is 
a node with c terminal descendants, where 2^^ < c < 
2'=+! + 1. 

Proof. For all k satisfying 2'^"'"^ + 1 < n, the root 
has n > 2^~^^ + 1 terminal descendants. Let A/q 
denote the root node, and let Mi denote the inter- 
nal node immediately descended from the root with 
the larger number of terminal descendants (choos- 
ing arbitrarily in case of a tie). Similarly let A/'2 
be the internal node (if it exists) immediately de- 
scended from A/i with the larger number of terminal 
descendants. Continue this process until a node Mm 
{m > 0) is reached which has at least 2'^"'"^ -|- 1 ter- 
minal descendants, but neither of whose immediate 
descendant nodes has more than 2^~^^ terminal de- 
scendants. Call Mm the "minimal node". It follows 
that at least one of the immediate descendant nodes 
of the minimal node has more than 2^ terminal de- 
scendants (since otherwise the minimal node would 
have at most 2(2^") < 2''+^-hl descendants). Thus at 
least one immediate descendant of the minimal node 
has c terminal descendants with 2^ < c < 2^~^^ + 1. 
□ 

Lemma 8. If for some k > 2, all species tree 
topologies with n taxa, n € {2^ + 1, . . . , 2^^+^}, have 
a nonempty too-greedy zone, then all species tree 
topologies with n > 2k + I (and thus n > 2^^ -|- 1) 
taxa have a nonempty too-greedy zone. 

Proof. Assume there exists k > 2 such that all 
species tree topologies with n G {2^ + 1, . . . , 2^^+-^} 
taxa have a nonempty too-greedy zone, i.e., that 
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Figure 3: Reduction of topologies used in the proof of Lemma 9. If two trees are connected by an edge, 
then the topology with the smaller number of leaves is a left subtree of the larger tree. 





Figure 4: Reduction of the remaining trees from Figure 3 to the four-taxon asymmetric case. Branches in 
orange are made long enough that all lineages on these branches coalesce with probability arbitrarily close 
to 1. 



there exist branch lengths for which the GACT does 
not match the species tree topology. By Lemma 7, 
any species tree a with more than 2^~^^ {k > 1) taxa 
S has some node Af with c terminal descendants, 
where c G {2^ + 1, . . . , 2^"+^}. Let a^f denote the 
species tree rooted at M and let Sj\f denote the taxa 
labeling the tips of a^. By assumption, a/^f has a 
nonempty too-greedy zone. 

Make the lengths of all branches outside of aj^ 
long enough that the probability that all lineages on 
these long branches coalesce is greater than 1 — £, 
where e is chosen so that 1 — e > 1/2 and 1 — e 
is greater than the probability of any clade within 
aj^ (i.e., any clade which is a proper subset of Sj\f). 
Because the greedy consensus tree is a refinement 
of the majority-rule consensus tree, all clades which 



include taxa outside of Sj^, and the clade consisting 
of all taxa in SV> are included in the GACT. When 
ranking clade probabilities as is required for the al- 
gorithm for constructing the GACT, these clades are 
added before the clades consisting of taxa which are 
proper subsets of Sj\f. Thus eventually the list of 
candidate clades consists only of proper subsets of 
Sj\f. When clades are accepted from this list, by as- 
sumption we accept at least one clade to be in the 
GACT which is not on a. Thus there exist branch 
lengths on a for which the GACT does not match 
the species tree. □ 



Lemma 9. For any species tree topology with 5, 
6, 7, or 8 taxa, there exists a set of branch lengths 
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for which the greedy asymptotic consensus tree does 
not match the species tree. 

Proof. This is shown by reduction to the four- 
taxon asymmetric case. For each species tree topol- 
ogy with 5, 6, 7, or 8 taxa, some branches can be 
made long, and some short so as to produce the 
same inconsistencies as in the four-taxon case. Most 
cases are shown in Figure [3l Here a topology with 
n taxa is connected by an edge to a topology with 
fewer than n taxa if the smaller topology is the left 
subtree — from the node which is the immediate left- 
descendant of the root — of the larger topology. In 
this case, for any e > 0, any branches on the larger 
topology not in the left subtree can be made arbi- 
trarily long. Thus any lineages available to coalesce 
on long branches do coalesce with probability greater 
than 1 — e. Remaining clades then have the same or- 
der of probabilities as on the left subtree, and thus 
are accepted by the greedy algorithm in the same 
order as on the left subtree. 

If the greedy consensus algorithm returns a non- 
matching tree for the smaller tree, it also does so 
for the larger tree since the ranking of the remaining 
clades by frequencies is eventually the same (once 
the high probability clades have already been added 
on the larger tree). This process of reducing trees 
can be repeated until one of the trees colored orange 
(which have no edges connecting to a smaller tree) 
is reached. 

It then remains to be shown that GACT does not 
match the species tree for the remaining orange trees 
from Figure 3. This is already shown explicitly for 
the four-taxon case (Lemma 6). For the other trees, 
these can again be reduced to the four-taxon case by 
choosing certain edges to be long and others short. 
This is shown in Figure HI By choosing the long, 
orange branches to have large branch lengths, the 
probability that all available lineages coalesce on a 
branch can be made greater than 1 — e/(2m), where 
m is the number of long branches on a tree. This 
makes the probability that all available lineages on 
long branches coalesce greater than 1 — e/2. Since 
only counterexamples are needed to show that the 
greedy consensus algorithm can return a nonmatch- 
ing tree, it is sufficient to note that branches can be 
chosen to be short enough using eq. 5 or Figure 5 for 
the four-taxon asymmetric tree to make the greedy 
consensus algorithm fail to return the tree matching 
the species tree with probability greater than 1— e/2. 



Making the black internal branches sufficiently short, 
the probability is greater than 1 — e that the the en- 
tire tree returned by the greedy consensus algorithm 
returns fails to match the species tree topology. □ 

Proof of Theorem 5. The result for three taxa 
follows from the fact that the matching gene tree 
has the highest probability of the three possible gene 
trees. The four-taxon asymmetric case is covered in 
Lemma 6. The four-taxon symmetric case is shown 
to be consistent in Appendix 2 by showing that for 
all branch lengths, (AB) and (CD) are the two most 
probable clades. We have shown that all cases with 
n = 5, 6, 7, or 8 taxa have too- greedy zones (Lemma 
9). From Lemma 8, this verifies by induction that 
all cases with n > 5 taxa have such zones. □ 

Proof of Theorem l(ii). The GACT and MACT 
are each examples of greedy and majority-rule con- 
sensus trees, respectively. It follows that if the 
MACT is fully resolved, then it is the same as 
the GACT since greedy consensus trees are re- 
finements of majority-rule consensus trees (iBryanti . 
20031 ). However, by Theorem 5, for any species 
tree topology with n > 5 taxa, there exist branch 
lengths for which the GACT has a clade not on the 
species tree, and therefore cannot be equivalent to 
the MACT (by Theorem l(i)). Therefore a suffi- 
cient condition for the MACT to be unresolved is 
for the GACT to not match the species tree. Since 
exact conditions for the MACT to not be fully re- 
solved were obtained earlier for smaller trees (the in- 
ternal branch length being no greater than log(4/3) 
for three-taxon trees and one of eqs. (l)-(4) to fail for 
four-taxon trees), the result follows for any species 
tree with n > 3 taxa. □ 



Finite Numbers of Loci 
Theory 

The asymptotic consensus trees occur in the limit 
as the number of loci approaches infinity. What hap- 
pens with a finite number of loci? In this case, we 
can examine the behavior of consensus trees from a 
theoretical point of view by considering all possible 
finite samples of gene trees. The probability of a par- 
ticular consensus tree is the sum of the probabilities 
of those samples of gene trees that result in that con- 
sensus tree. These probabilities can be determined 
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in the sum ( Ross . 19981 . p. 13). For four taxa and 25 
loci, the sum has approximately 1.51 x lO^'' terms. 

To compute the probabilities of finite-sample 
greedy consensus trees, probabilities of resolutions 
of ties must also be taken into account. This can 
be done by summing over all possible tie-breaks 
and treating each possible tie-break as equally likely, 
rather than randomly breaking ties. The probability 
of the greedy consensus tree having topology T can 
therefore be written as 



E 



£i,...,4>0 



Pi ■■■Pk 



E^^^ E 

.bieBi breBr(bi,...,br-i) 



l[Fiibj)I{c{ii, . . . ,ik,bu ■ ■ ■ ,br) = r) 



Figure 5: The too-greedy zone. The upper curve is 
the boundary of the anomaly zone for the species 
tree (((AB)C)D). For points below this curve, there 
is either one or three anomalous gene trees (AGTs). 
The two blue regions to the left of the curve 
which extends from roughly {x,y) = (0.067,0.0) to 
(0.0078, 2.0) constitute the too-greedy zone, where 
the GACT is ((AB)(CD)). 

by noting that a sample of independent loci has a 
multinomial distribution, where the categories are 
the gene tree topologies, and the probabilities are 
given by the theory of th e multispecies coalescent 
(jPegnan and Salterl lioosl l. 

To compute the probability of a consensus tree 



given a finite sample of £ gene trees, let 



l,...,k 



be the number of times gene tree i is observed, where 
^ ■ £i = L, and there are k possible gene tree topolo- 
gies, let c(^i, . . . ,£k) denote the consensus tree re- 
sulting from a particular sample. The probability 
that a sample results in the consensus tree having 
topology T is therefore 

£\ 



E 

ei,...,ik>o 



p{'---pI' I{c{£i,...,£k) = T) 



£il---£kl 



(6) 

where / is an indicator that the consensus tree has 
topology T, Pi is the gene tree probability for the ith 
topology, and the sum is over all nonnegative integer 
solutions to .^1 + - • ■+£k = £■ There are {^^^1^) terms 



where Bj denotes the set of possible tie-breaks in the 
jth round, bj denotes one way (out of \Bj\ possible 
ways, where \Bi\ is the number of elements in Bj) 
of breaking up a set of tied clade frequencies in the 
jth round (out of r rounds) of choosing clades for 
the greedy consensus tree, and Pr(6j) = 1/|-Bj| is 
the probability of a particular tie break. In general, 
the set Bj is a function of the choices hi, ... , hj-i in 
preceding rounds of tie-breaks, since the possible tie 
breaks in a given round may depend on how previ- 
ous ties were resolved. For n-taxon trees, there are 
n — 2 rounds of tie breaks, assuming the case when 
no tie breaks are necessary (i.e., there is one clade on 
the list which is most frequent) is treated as a trivial 
tie break with \Bj \ = 1. For example, for four-taxon 
trees, there are two rounds of tie breaks. The func- 
tion c in eq. 7 has been given additional arguments 
(compared with eq. 6) so that the consensus tree is 
a function of both the gene tree frequencies and the 
tie-breaks. 

Because there are a finite number of trees and con- 
sensus trees are computed for every sample, many 
samples include gene trees which imply incompat- 
ible sets of rooted triples due to there being ties 
in the most frequently occurring rooted triple for 
a given set of taxa. In these cases, the R* algo- 
rithm returns a tree which is partially or completely 
unresolved. For example, if there are four input 
gene trees: (((AB)C)D), (((AD)C)B), (((BC)A)D), 
and (((CD)A)B), then the rooted triples (AD)B and 
(AB)D each occur twice; thus the R* consensus 
tree is unresolved with respect to the relationships 
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between A, B, and D. Similarly the rooted triples 
(BC)D and (CD)B each occur twice. However, the 
rooted triple (AC)B occurs twice whereas (AB)C 
and (BC)A each occur once, so the R* consensus 
tree has the rooted triple (AC)B. Similarly, (AC)D 
occurs in the R* tree. Thus the R* consensus tree 
for this set of gene trees is the partially unresolved 
tree ((AC)BD). The majority-rule tree for this set of 
input trees is completely unresolved, and the greedy 
consensus tree returns each of the four input trees 
with probability 0.25. 

Examples 

Three taxa. — We illustrate the case of finite loci 
using three (Fig. [6]) and four taxa (Figs. [7] and 
[8]). With three taxa, there is only one internal 
branch length, and this determines all gene tree 
probabilities, with the probability that the gene tree 
matches the species tree being 1 - (2/3)e-^, where 
T is the length of the internal branch. We used 
((AB)C) as the species tree with branch lengths 
of 0.5,log(4/3) f» 0.288, and 0.1, corresponding to 
matching probabilities of 0.596, 0.5, and 0.397, re- 
spectively. 

For the branch length of 0.5, the majority of 
loci (almost 60%) are likely to have the matching 
topology; thus, given enough loci, all three methods 
(majority-rule, R*, and greedy) are expected to 
have a high probability of returning the matching 
tree. This does in fact occur, with the greedy 
consensus tree having the highest probability for 
any given number of loci. The R* method has the 
second-best performance, although by 50 loci, the 
greedy and R* algorithms have roughly equivalent 
performance. When the branch length was chosen 
such that the probability of matching was 0.5 
(Fig. [6l3, with the two nonmatching trees each 
having probability 0.25), majority-rule was stuck 
between returning the correct tree and the star 
tree. This was not surprising since ((AB)C) by 
design does not occur more than 50% of the time. 
The pattern for this case, as well as for the branch 
length of 0.1 (Fig. [6p), continues for greedy and R* 
consensus, with greedy having the best performance, 
and R* slowly approaching greedy as the number of 
loci increases (and therefore the probability of ties 
decreases). Also, for the branch length of 0.1, no 
tree has greater than 50% probability of occurring, 
and therefore majority-rule becomes increasingly 
likely to return a star tree as the number of loci 
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Figure 6: Species tree ((AB)C) — Probabilities of 
consensus trees from finite numbers of known gene 
trees. Each plot shows the probability that each of 
the three consensus methods will return either the 
species topology, ((AB)C) or a star tree {R* and 
majority-rule only). The legend in (A) applies to 
each of the three plots. 
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increases. 

Four taxa. — Figure [7] shows the behavior of the 
three consensus methods as the number of loci in- 
creases when the species tree is (((AB)C)D), and 
Figure [8] shows the same consensus methods when 
the species tree is ((AB)(CD)). The results of the 
two figures are similar, although the methods gener- 
ally perform better with the symmetric species tree. 

Figure [7]A suggests that large numbers of loci 
might be needed before one majority-rule consensus 
tree becomes the most probable. Figures 03,C and 
[SJ3,C, show that majority-rule consensus can fairly 
quickly converge to a star phylogeny even though the 
probability of a star phylogeny decreases under R* 
and greedy consensus. 

For majority-rule trees, there is also an effect of 
having an odd or even sample size, where even sam- 
ple sizes tend to give higher probabilities to unre- 
solved trees. This occurs because even sample sizes 
increase the opportunity for ties in the number of 
times two (or more) clades are observed, and in these 
cases neither clade can be in the majority. This has 
the somewhat surprising result that a consensus tree 
can be less likely to match the species tree in a sam- 
ple of 2n loci than in a sample of 2?i — 1 loci (al- 
though in being more likely to have an unresolved 
tree, it is also less likely to produce a tree resolved 
in a way that conflicts with the species tree). For 
the symmetric species topology with branch lengths 
of X = 0.6 and y = 0.4, note that the majority- 
rule consensus tree is more likely to be the species 
tree topology ((AB)(CD)) than any other topol- 
ogy if the sample size is odd, but for even sample 
sizes up to 25 loci, the unresolved tree ((AB)CD) is 
roughly tied in probability with ((AB)(CD)). This 
is consistent with Figure 2B, in which the point 
(x, y) = (0.6, 0.4) is close the boundary between the 
regions for ((AB)(CD)) and (AB(CD)). However, if 
the number of loci is sufficiently large, majority-rule 
consensus is expected to return the resolved tree 
((AB)(CD)) that matches the species topology, since 
the point (x, y) = (0.6, 0.4) is slightly outside the 
zone where the MACT is unresolved. This can be 
verified from equations 3 and 4. 

As the number of loci increases, the finite-sample 
R* trees (Figs. [7p-F and Figs. [8p-F) show increas- 
ing probability of matching the species tree topol- 
ogy, regardless of how short the branches are, in- 
cluding for branch lengths that are in the anomaly 



zone, (x,y) = (0.1,0.1), and the too-greedy zone, 
(x,y) = (0.05,0.05). This agrees with our theoreti- 
cal expectations of R* consensus trees (Theorem 4); 
however, the increase in probability is very slow. For 
example, when (x, y) = (0.1, 0.1) and the species tree 
is asymmetric (Fig. [7p), the two trees most likely to 
be returned are (ABCD) and ((AB)CD) until there 
are 23 loci, at which point the matching topology 
(((AB)C)D) changes from being the third to the sec- 
ond most probable topology. The star tree (ABCD) 
has the highest probability for 11 and fewer loci, and 
as a trend is decreasing in probability as the sam- 
ple size increases. The tree ((AB)CD), however, is 
still increasing in probability at 25 loci; thus large 
numbers of loci might be needed for R* to show a 
clear preference for the matching tree. The probabil- 
ity that R* returns the species tree topology grows 
more slowly when (x, y) = (0.05, 0.05) (Figs. 7F, 8F); 
however, it is the only one of the three methods for 
which the probability is increasing with those branch 
lengths. 

Greedy consensus trees show more smoothly in- 
creasing probabilities of returning the matching tree 
for branch lengths outside of the too-greedy zone 
(Figs. [7p,H and Figs. [DGl-I). When the species tree 
is (((AB)C)D) and (x,y) = (0.1,0.1) (Fig. 7H), the 
gene tree ((AB)(CD)) is more probable than the 
matching tree, and here greedy consensus is slightly 
more likely to return this tree for small samples; but 
the matching tree becomes the most probable greedy 
consensus tree with 11 or more loci. However, for 
this species tree, the more extreme branch lengths 
of (x,y) = (0.05,0.05) make increasing the number 
of loci more likely to result in greedy consensus re- 
turning the nonmatching tree ((AB)(CD)) (Fig. [71). 
These results are consistent with our expectations 
based on the too-greedy zone (Fig. [5]). 

Discussion 

Using coalescent probabilities to determine 
asymptotic consensus trees enables the prediction of 
what occurs when consensus trees are constructed 
from gene trees from many independent loci. We 
have obtained results for the three types of asymp- 
totic consensus trees considered: majority-rule, R* , 
and greedy (Theorems 1, 3, and 5, respectively), 
which describe the fact that with an infinite num- 
ber of loci, MACTs might be unresolved, GACTs 
might be nonmatching, and RACTs always match 
the species tree. These results have implications for 
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a common goal of phylogenetics: the inference of 
species trees. 



power to resolve some clades for some sets of branch 
lengths. 



Estimating Species Trees 



Mutation and recombination 



Although concatenation of sequences is per- 
haps the most widely used method of estimat- 
ing species trees, there are several current al- 
ternatives to concatenation for inferring species 
trees. These include ra ii iimizi ng deep coalescence 
( Maddison and Knowles . 20061 ). finding the joint 
posterior of the species tree and gene trees from 
the coalescent model in a Bayesian framework 
(|Liu and Pearll . 120071 ). using the most ancient spe- 
ciation times compatible with the set of inferred 
coalescent times on a set of gene trees (called 
the "maximum tree" by Liu and Pearl (2007) or 
"GLASS tree" by Mossel and Roch (2007)), and us- 
ing probabilities of gene tre e topologies to approxi- 
mate the species likelihood (Carstens and Knowled . 



20071 : ICarling and Brumfieldl . l2008l ). These meth- 
ods are designed estimate species trees when there 
is gene tree conflict due to incomplete lineage sort- 
ing, and they do not assume that sequence data are 
generated under a single gene tree topology. 

Theorem 4 suggests a statistically consistent 
method for building species tree topologies from gene 
tree topologies (assuming known gene trees). This 
involves inferring all rooted triples and then applying 
a method such as that of Bryant and Berry (2001) 
to build up the tree by the (3) rooted triples. 

Although this method does not estimate branch 
lengths on the species tree, rooted triples could also 
be used to estimate internal branch lengths on the 
species tree by using i-'o-[(AB)C] = 1 — (2/3) e~-^, 
where T is the length separating the MRCA of A, 
B, and C from the MRCA of A and B. Thus, the 
frequency of each rooted triple in the observed set of 
gene trees could be used to estimate species diver- 
gence times, from which the species tree (including 
topology) could be constructed; or, given a species 
tree topology, the set of branch lengths most compat- 
ible with the observed rooted triples could be deter- 
mined using a criterion such as maximum likelihood 
or least squares. 

Using majority-rule trees to estimate species trees 
from finitely many loci is expected to not be mis- 
leading, but is likely to result in a tree that is at 
least partially unresolved. It is thus expected to be 
a conservative estimate of the species tree, with little 



In this paper, we have not considered the roles of 
mutation and recombination and the resulting un- 
certainty that occurs when gene trees are inferred 
from sequence data. When gene trees are estimated 
and the underlying species tree has short branches, 
some gene trees are expected to not be fully resolved 
due to insufficient sequence divergence. Due to the 
inherent stochasticity in sequence evolution, there 
will also be some incorrectly inferred gene trees. For 
finite numbers of genes, these factors would tend to 
increase the probability that majority-rule consensus 
trees would have some lack of resolution, whether 
or not the true MACT was fully resolved. If the 
MACT is a star tree, we speculate that mutation 
would cause convergence to a star tree occur more 
quickly as the number of loci is increased. If the 
MACT does have some resolved clades, then uncer- 
tainty in the gene trees would be expected to increase 
the number of loci needed to have a high probabil- 
ity that the majority-rule tree is correctly resolved. 
We expect similar effects for R* and greedy consen- 
sus trees, but ultimately, the effects of mutation on 
constructing consensus trees could be assessed by 
simulating sequence data for independent gene trees 
evolving in the same species tree. 

When recombination occurs within genes, dif- 
ferent topologies may exist for different segments 
within a gene, f urther comp l icatin g the distribution 



of site patterns (jWiuf et al.l . I2OOII ) 



Conclusions 

Our results show that when there is sufficient gene 
tree discordance due to incomplete lineage sorting, 
majority-rule consensus trees can have a high proba- 
bility of being at least partially unresolved, and the 
probability of being unresolved can approach 1 as 
the number of genes increases indefinitely. However, 
the MACT is never resolved incorrectly; that is, it 
never has a clade not supported on the species tree. 
We therefore describe the MACT as not misleading; 
however, it is not consistent, because statistical con- 
sistency implies that an estimator gets arbitrarily 
close to a parameter (e.g., a fully resolved species 
tree) with probability approaching 1 as the sample 
size increases. 
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The fact that under the multispccics coalescent, 
R* trees are asymptotically guaranteed to be fully re- 
solved and to match the species tree topology means 
that the R* procedure is not only not misleading, 
but is also a statistically consistent estimator of the 
species tree topology. This is remarkable consider- 
ing that R* trees (which are defined for any collec- 
tion of trees) are based only minimally on a model of 
species tree-gene tree relationships. The only feature 
of the multispecies coalescent model used in proving 
the consistency of the R* method is the fact that 
in this model, three-taxon relationships that occur 
in the species tree arc also expected to occur in the 
gene tree distributions. Thus, although R* consen- 
sus trees are consistent without explicitly incorpo- 
rating gene tree probabilities into its algorithm for 
constructing trees, the R* consensus tree is not nec- 
essarily robust to violations of assumptions in the 
coalescent, such as the absence of population struc- 
ture along ancient internal edges. 

Finally, greedy consensus trees can be increasingly 
likely (as the number of gene trees increases) to have 
a topology that differs from that of the species tree. 
Thus greedy consensus trees can be positively mis- 
leading if used as estimators of species trees. How- 
ever, for four taxa, the region of parameter space in 
which greedy consensus fails to return the true tree — 
the too-greedy zone — is smaller than the anomaly 
zone; hence greedy consensus offers some robust- 
ness to gene tree discordance that may cause other 
methods to fail to recover the species tree. In ad- 
dition, the greedy consensus method outperformed 
our other methods for branch lengths outside of the 
too-greedy zone. To test these consensus methods in 
practice will require examining their performance in 
the presence of mutation (both from real and sim- 
ulated sequence data) that can cause gene trees to 
be estimated with uncertainty rather than treated 
as known. Although in our results, R* consensus 
outperformed majority-rule consensus, for R* and 
greedy consensus there may be a tradeoff between 
consistency and speed of convergence, with greedy 
consensus being the quicker to converge yet statisti- 
cally consistent, and with R* consensus being slow 
to converge yet statistically consistent. 
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Table 1: Probabilities of four-taxon gene tree topologies, clades, and rooted triples for the species tree 
(((AB)C)D) with different branch lengths. A clade (rooted triple) probability is the sum of probabilities 
of gene tree topologies which have the clade (rooted triple). Branch lengths are as in the model species 
tree in Figure lA. An asterisk indicates that a clade has probability greater than 1/2, and would therefore 
be represented in the MACT. 



Gene tree 


Probability 


(.6, .4) 


(.4, .6) 


Branch lengths {x, y) 
(.8, .3) (.3, .3) (.l,.l) 
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.075 
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.052 
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Clade 
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.275 


{AC} 


P3 + P4 + P14 


.211 


.165 


.239 


.213 


.227 


.226 


|AD} 


+ Pfi + Pi «i 


.067 


.071 


.059 


.108 


.174 


.196 


{BC} 


Pr +Pa + PiB 


.110 


.104 


.103 


.149 


.227 


.226 


{BD} 


pg + pio + pi4 


.067 


.071 


.059 


.108 


.174 


.196 


{CD} 


Pll + Pl2 + Pl3 


.128 


.171 


.098 


.172 


.202 


.212 




Pl + P3 + P7 


.530* 


.458 


.601* 


.373 


.236 


.201 


{ABD} 


P2 + PUi + P9 


.121 


.162 


.094 


.155 


.165 


.166 


{ACD} 


P4 + P6 + Pll 


.061 


.061 


.055 


.091 


.136 


.151 


{BCD} 


P8 "t" PIO "I" P12 


.061 


.061 


.055 


.091 


.136 


.151 


Rooted triple 
















(AB)C 


Pl + P2 + P5 + P9 + pia 


.553 


.634 


.506 


.506 


.397 


.366 


(AC)B 


P3 + P4 + P6 + Pll + P14 


.223 


.183 


.247 


.247 


.302 


.317 


(BC)A 


P7 + P8 + PlO + P12 + P15 


.223 


.183 


.247 


.247 


..302 


.317 


(AB)D 


Pl + P2 + P3 + P7 + P13 


.755 


.755 


.778 


.634 


.454 


.397 


(AD)B 


P4 + P5 + P6 + Pll + P15 


.123 


.123 


.111 


.183 


.273 


.302 


(BD)A 


PS + P!) + PIO + P12 + Pl4 


.123 


.123 


.111 


.183 


.273 


.302 


(AC)D 


Pl + P3 + P4 + P7 + P14 


.634 


.553 


.700 


.506 


.397 


.366 


(AD)C 


P2 + P5 + P6 + P9 + P15 


.183 


.223 


.150 


.247 


.302 


.317 


(CD)A 


P8 +P10 +P11 +P12 +P13 


.183 


.150 


.247 


.223 


.302 


.317 


(BC)D 


Pl + P3 + P7 + P8 + P15 


.6.34 


.553 


.700 


.506 


.397 


.366 


(BD)C 


P2 + P5 + P9 + PlO + P14 


.183 


.223 


.150 


.247 


.302 


.317 


(CD)B 


P4 + P6 + Pll + P12 + P13 


.183 


.223 


.150 


.247 


.302 


.317 



18 



Table 2: Probabilities of four-taxon gene trees, clades, and rooted triples as functions of terms gij{T). 
The branch lengths x and y are as in Figure lA. The probabilities of clades (rooted triples) are obtained 
by adding the probabilities of gene trees for which have the clade (rooted triple, see Table 1). For each 
entry in the tabic, the left and right numbers are the coefficients of the gij(T) terms for the species trees 
(((AB)C)D) and ((AB)(CD)), respectively. 
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Figure 7: Species tree (((AB)C)D) — Probabilities of consensus trees as functions of sample size (number 
of loci). One consensus algorithm is used for each row of plots, and one set of branch lengths is used for 
each column. For the majority-rule and R* algorithms, there are 26 possible four-taxon consensus trees, 
including 15 fully resolved trees and 11 trees not fully resolved. The graphs only show some of the more 
frequently occurring consensus trees; consequently probabilities do not add to 1.0. The legends in the 
lefthand column apply to the three plots in their corresponding rows. 
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Figure 8: Species tree ((AB)(CD)) — Probabilities of consensus trees as functions of sample size (number 
of loci). One consensus algorithm is used for each row of plots, and one set of branch lengths is used for 
each column. For the majority-rule and R* algorithms, there are 26 possible four-taxon consensus trees, 
including 15 fully resolved trees and 11 trees not fully resolved. The graphs only show some of the more 
frequently occurring consensus trees; consequently probabilities do not add to 1.0. The legends in the 
lefthand column apply to the three plots in their corresponding rows. 
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Appendix 1: Majority- Rule Unresolved 
Zones, Species Tree (((AB)C)D) 

In this appendix we derive conditions for which 
the MACT is unresolved for the four-taxon species 
trees (((AB)C)D) and ((AB)(CD)). This is done by 
finding branch lengths for which there exist clades 
with probability greater than 1/2. First, the follow- 
ing result about cherries is useful, which is analogous 
to Proposition 1 and has a similar proof. 

Proposition 10. Let a be the species tree where 
S is the set of taxa on a. Then for any A, B, C G S, if 
{AB} is a cherry on a, then ^^[{AB}] > P„[{kC}]. 



Proof. The proof is very similar to the proof of 
Lemma 2. 

Remark 11. If {AB} is a cherry on the 
species tree cr, then for any taxon C, Pa-[{AC}] = 
P,[{BC}] < 1/3. 



^522(^)533(2;); and if the species tree is ((AB)(CD)), 
the probability of clade {CD} is 521 (y) 52i (a^) + 

\g2i{y)g22{x) + \g22{y)g2i{x) + Ag22{y)g33{x). 



P,[{ABC}] =pi+p3+P7 
2 ^ 1 



1 



1 



3 6 



i3x+y) 



(9) 



Setting P(j[{ABC}] > 1/2, we obtain a condition for 
which the consensus tree has the clade {ABC}. We 
also note that no other three-taxon clade can be on 
the MACT because they are each incompatible with 
and less probable than {ABC}, and therefore have 
probabilities less than 1/2. This can be verified by 
checking their probabilities from Table 2 and com- 
paring coefficients of the gij{T) terms, 

Three-taxon clades for the species tree 
(((AB)C)D) have the probabilities: 

P,({ABC}) = g2i{y)g2i{x) + ^521 (5)522 (a;) 

3 

+ 522(5)531(2;) + -522(5)532(2;) 



The equality holds by symmetry; the inequality fol- 
lows from Proposition 10. 

To find branch lengths for the species tree 
(((AB)C)D) where the MACT is resolved, consider 
the probabilities of clades {ABC} and {AB}. Ta- 
ble[T]lists the probability that A, B, and C are mono- 
phyletic as Pi + Ps + Pr, where pi is the probability 
of gene tree i in the same table, because for gene 
trees 1, 3, and 7 (and only these gene trees), these 
three taxa are monophyletic. Table 2 can be used to 
compute probabilities of gene trees, clades, or rooted 
triples for four-taxon trees as linear combinations of 
products of the terms gij(T), which denote the prob- 
ability that i lineages coalesce into j lineages within 
T coalescent units, where i > j > 1 , and T > . 
For i = 2, 3, the q,;,- jt ) functions are (jTavarel . Il984l : 
Pamilo and Neil . 1 19881 ): 



52i(T) = l-e-^ 531 (T) = 1-^6-^ + ^3^ 
522(T) = e-^ g32{T) = \e-^ -le-'^ (7) 



533 (r) 



-3r 



(8) 



For example, we see from Table 2 that if 
the species tree is (((AB)C)D), the probability 
of clade {CD} is ^521(5)522(2;) + 5522(5)532(2;) + 



= ^^522(5)533(2;) 

P^({ABD}) = ^521(5)522(2;) + ^522(5)532(2;) 
3 

+ ^^522(5)533(2;) 

P,({ACD}) = P.({BCD}) = ^522(5)532(2;) 
3 

+ —522(5)533(2;) 

The grouping {AB} is monophyletic with proba- 
bility greater than 1/2 \ipi+p2 +P13 > 1/2. Again 
using Table 2 and eq. [71 this occurs when 



^'.[{AB)] = 1 



3 



lg-(3a'+y) 

9 



(10) 



is greater than one-half. Solving for y yields Equa- 
tion dl]). 

The four trees shown in Figure 2 are the only con- 
sensus trees possible regardless of the set of branch 
lengths. To show that Proposition 10 guarantees 
that all cherries incompatible with {AB} (which in- 
cludes all two-taxon clades other than {AB} and 
{CD}) are less probable than {AB} and therefore 
have probabilities lower than 1/2 and thus cannot 
be on the MACT. To show that {CD} cannot occur 
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on the MACT for this species tree, it must be shown 
that this clade has probabihty less than one-half. 
The probability that {CD} is monophyletic is 

3d 18 

< 1 + le-(3-+J/) < 1 + 1 < 1. 
3 18 3 18 2 

(11) 

Appendix 2: Majority-Rule Unresolved 
Zones, Species Tree ((AB)(CD)) 

Similar calculations as in Appendix 1 can be per- 
formed when the species tree is ((AB)(CD)). For 
this tree, three-taxon groups cannot have probabil- 
ity greater than 1/3. For example, the probability 
for monophyly of {ABC} is (from Table 2 and eq. [7]) 

-e"^ - Ae-(^+s/) < ie^^ < -. (12) 
3 18 3 3 ^^ 

Thus the MACT for a symmetric four-taxon species 
tree cannot have a clade with three taxa. 

All cherries other than {AB} and {CD} are incom- 
patible with these two cherries (which occur on this 
species tree), and from Remark 11, any two-taxon 
clades other than {AB} and {CD} have probabil- 
ity less than 1/2 and cannot occur on the MACT. 
The two clades that can occur on the MACT have 
probabilities 

P^({AB}) = 1 - le-y - le-^^+y\ and (13) 

P,({CD}) = l-^e---ie-(-+J'). (14) 

Setting these functions to be greater than 1/2 yields 
Equations © and ([3]). 

Here the probability that {AB} is a clade cannot 
greater than 1/2 for y < log (4/3), and the prob- 
ability of clade {CD} cannot be greater than 1/2 
for X < log(4/3). These values form asymptotes on 
the graph of the unresolved zone for the symmetric 
species tree (Fig. [2|3). 

Appendix 3: The Too-Greedy Zone, Species 
Tree (((AB)C)D) 

In this appendix, we show that when the species 
tree has topology (((AB)C)D), finding the branch 



lengths for the too-greedy zone is equivalent to de- 
termining the set of branch lengths for which {CD} 
is more probable than {ABC}. 

For the species tree (((AB)C)D) with any set of 
branch lengths, {ABC} is the most probable three- 
taxon clade, and {AB} is the most probable two- 
taxon clade. These facts can be verified by compar- 
ing clade probabilities in Table 2. 

In general, {AB} is not more probable than 
{ABC}, however, since the branch ancestral to A 
and B but not C might be very short and the branch 
ancestral to A, B, and C, but not D, might be very 
long. In the latter case {ABC} has probability near 
1, and {AB} has probability near 1/3. 

To show that when the species tree has topology 
(((AB)C)D), the GACT is always nonmatching if 
and only if {CD} is more probable than {ABC}, we 
consider cases where {ABC} is either (i) more prob- 
able, (ii-iv) less probable, or (v) equally probable as 
{AB}. In (ii-iv), we also consider whether {CD} is 
(ii) less probable, (iii) more probable, or (iv) equally 
probable as {ABC}. Since these cases exhaust all 
possibilities, and greedy consensus returns a non- 
matching tree in case (iii) and with probability 1/2 
in case (iv), we get the desired result. 

(i) P[{ABC}] > P[{AB}]. Here {ABC} is the 
most probable clade other than {ABCD} and is 
therefore included in the GACT. The remaining 
compatible clades are {AB}, {AC} and {BC}. By 
comparing clade probabilities in Table 2, or by using 
Proposition 10, {AB} is the most probable clade of 
these three. Thus the GACT is (((AB)C)D). 

(ii) P[{CD}] < P[{ABC}] < P[{AB}]. In this 
case, {AB} is the most probable clade (other than 
{ABCD}) and is therefore in the GACT. The re- 
maining compatible clades are {CD}, {ABC}, and 
{ABD}. Since P[{ABD}] < P[{ABC}] (Table 2), 
{ABD} cannot be on the GACT, thus the GACT is 
(((AB)C)D). 

(iii) P[{ABC}] < P[{CD}] < P[{AB}] In this 
case the GACT is ((AB)(CD)). Also P[{ABC}] < 
P[{CD}] < P[{AB}], so P[{ABC}] < P[{CD}] is a 
sufficient condition for the GACT to be ((AB)(CD)). 

(iv) P[{ABC}] = P[{CD}] < P[{AB}] This equal- 
ity only holds when eq. 5 is an equality, which is for 
points on the boundary of the too-greedy zone. In 
this case the GACT is ((AB)(CD)) or (((AB)C)D), 
each with probability 1/2. 

(v) Finally, if P[{ABC}] = P[{AB}], then the 
GACT is (((AB)C)D) since in this case these are 
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the two most probable clades. 

Having considered all cases, P[{ABC}] < 
P[{CD}] is necessary and sufficient for ((AB)(CD)) 
to be the GACT with probability 1, and 
P[{ABC}] = P[{CD}] is necessary and sufficient for 
((AB)(CD)) to be the GACT with probability 1/2. 
The probabilities of {ABC} and {CD} are given in 
eqs. 6 and 8, respectively, in Appendix 1. Setting 
P({CD}) > P({ABC}) and solving for y yields eq. 5. 

Appendix 4: The Too-Greedy Zone, Species 
Tree ((AB)(CD)) 

We now show that if the species tree has topology 
((AB)(CD)), then the GACT matches the species 
tree. First note that for this species tree, {AB} 
and {CD} are always each more probable than any 
three-taxon clade. This can be verified by compar- 
ing coefficients of the gij terms in the clade proba- 
bilities from Table 2 and by noting that gij{T) > 
for T > 0: 



P^({AB}) = g2i{y)g2i{x) + ^g2i{y)g22{x) 
1 4 

+ -^922{y)92l{x) + —522(^)522(2;) 

Pa({CD}) = g2i{y)g2i{x) + ^521(^)522(3;) 
3 4 

+ 3522(^)521(2;) + Y^522(y)522(a;) 

P^({ABC}) = 52i(2/)52i(a;) + ^52i(y)522(x) 
3 

+ Y^922{y)g22{x) 
1 3 

P^({ABD}) = -521(^)522(2;) + — 522(y)522(x) 
P,({ACD}) = P,({BCD}) = P,({ABD}) 

Also, from Proposition 10, {AB} is more probable 
than any cherry clade other than {CD}, and {CD} is 
more probable than any two-taxon clade other than 
{AB}. Prom this it follows that the first clade cho- 
sen in the greedy algorithm (other than {ABCD}) is 
either {AB} or {CD}, since any other clade would be 
less probable than one of these two. If {AB} is most 
probable, the remaining compatible clades are {CD}, 
{ABC}, and {ABD}. However, since {CD} is al- 
ways more probable than {ACD} and {BCD}, {CD} 
would be chosen after {AB}. Similarly, if {CD} is 
chosen first, {AB} is more probable than the remain- 



ing clades and so is chosen second. Thus the GACT 
is always ((AB)(CD)) for this species tree. 
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